# Required packages and libraries
library(tidyverse)
library(lubridate)
library(fpp2)
library(fpp3)
library(tsibble) 
library(tidymodels) 
library(modeltime) 
library(modeltime.ensemble) # ensembles   
library(timetk)
library(ggplot2)
library(ggmap)
library(fable)
library(feasts)
library(fable.prophet)

# to use tscv package, first these two commented rows should be run.
# install.packages("devtools")
# devtools::install_github("ahaeusser/tscv")
library(tscv)

Introduction

Description of the data set

First dataset - Traffic Intensity

  • This data set has been downloaded from here.

On the below visual, the detailed description of each variable in the data can be found:

Dataset Information

The data set trafficMAD is a day-ahead traffic data of Madrid city for January 2022. The data set contains minute time series data which has observations on every 15 minutes from 2022-01-01 to 2022-01-31 for 368 zones (id) in Madrid.

setwd("C:/UC3M Master Lectures/TERM 3/Time Series/PART2/Final")
# Around 4000 points in Madrid where traffic is monitored at minutes

trafficMAD <- read.delim("01-2022-traffic.csv", na.strings = "NaN", sep=",")
head(trafficMAD,5)
##     id        long_date elem_type intensity occupation charge vmed error
## 1 1001 01/01/2022 00:00       M30       408          1      0   61     N
## 2 1001 01/01/2022 00:15       M30       156          0      0   69     N
## 3 1001 01/01/2022 00:30       M30       192         NA      0   76     N
## 4 1001 01/01/2022 00:45       M30       600          2      0   67     N
## 5 1001 01/01/2022 01:00       M30      1065          2      0   65     N
##   integration_period
## 1                  5
## 2                  5
## 3                  5
## 4                  5
## 5                  4

The 368 zones (id) of the time series:

trafficMAD %>% 
  group_by(id) %>% summarise(Unique_IDs = n_distinct(id))
## # A tibble: 368 x 2
##       id Unique_IDs
##    <int>      <int>
##  1  1001          1
##  2  1002          1
##  3  1003          1
##  4  1006          1
##  5  1009          1
##  6  1010          1
##  7  1011          1
##  8  1012          1
##  9  1013          1
## 10  1014          1
## # ... with 358 more rows

Second dataset - Location of traffic measurement points

  • Data source to access: link).

The data set locationMAD contains the location information of each traffic measurement zone (each id in the first dataset) in Madrid city for January 2022. The data has a common id and elem_type column with the other dataset. As seen on the below, in this data, there are the name of each id and their district zone in which area they belong. Moreover, in the data there are utm_x and utm_y columns which are the x and y coordinates of the centroid of the polygon representation of the measurement points. The last two columns which are longitude and latitude represents the points place geographically.

# adding station information
locationMAD = read.delim('pmed_ubicacion_01-2022.csv', sep=",")
locationMAD$id = as.factor(locationMAD$id)
head(locationMAD,5)
##   elem_type district   id cod_cent
## 1       URB        4 3840     1001
## 2       URB        4 3841     1002
## 3       URB        1 3842     1003
## 4       URB        4 3843     1004
## 5       URB        4 3844     1005
##                                                               name    utm_x
## 1                Jose Ortega y Gasset E-O - Pº Castellana-Serrano 441615.3
## 2                Jose Ortega y Gasset O-E - Serrano-Pº Castellana 441705.9
## 3                               Pº Recoletos N-S - Almirante-Prim 441319.4
## 4                       Pº Recoletos S-N - Pl. Cibeles- Recoletos 441301.6
## 5 (AFOROS) Pº Castellana S-N  - Eduardo Dato - Pl.Emilio Castelar 441605.8
##     utm_y longitude latitude
## 1 4475768 -3.688323 40.43050
## 2 4475770 -3.687256 40.43052
## 3 4474841 -3.691727 40.42213
## 4 4474764 -3.691929 40.42143
## 5 4476132 -3.688470 40.43378

In the data preprocessig section, these two dataset will be merged to get the specific information of the location of each ID. Also the district with the highest traffic intensity will be found and analyzed.

Data preprocessing

As mentioned before, the data has observations for every 15 minutes, therefore, the time series is quite complex, because the series has high frequency. To reduce the complexity, here, the minute frequency will be aggregated into hour. In this way the time series could be smoother.

Performing hour-by-hour averaging of each day in the month

Before using any forecasting method, the data has to be arranged such as changing the data type of some columns, adding columns that specifies the required timing like hour, day of the week (such as “Monday, Sunday”) or the number of the week in a month etc.

## Changing the type of the columns

trafficMAD$id = as.factor(trafficMAD$id)
trafficMAD$elem_type = as.factor(trafficMAD$elem_type)

## Dates
trafficMAD$day = format(strptime(trafficMAD$long_date,format='%d/%m/%Y %H:%M'),'%Y-%m-%d %H') 
trafficMAD$hourly = hour(strptime(trafficMAD$long_date,format='%d/%m/%Y %H:%M'))
trafficMAD$day_of_week = weekdays(strptime(trafficMAD$long_date,format='%d/%m/%Y %H:%M'))
trafficMAD$week_no = week(strptime(trafficMAD$long_date,format='%d/%m/%Y %H:%M'))

head(trafficMAD,6)
##     id        long_date elem_type intensity occupation charge vmed error
## 1 1001 01/01/2022 00:00       M30       408          1      0   61     N
## 2 1001 01/01/2022 00:15       M30       156          0      0   69     N
## 3 1001 01/01/2022 00:30       M30       192         NA      0   76     N
## 4 1001 01/01/2022 00:45       M30       600          2      0   67     N
## 5 1001 01/01/2022 01:00       M30      1065          2      0   65     N
## 6 1001 01/01/2022 01:15       M30      1704          5      0   66     N
##   integration_period           day hourly day_of_week week_no
## 1                  5 2022-01-01 00      0    Saturday       1
## 2                  5 2022-01-01 00      0    Saturday       1
## 3                  5 2022-01-01 00      0    Saturday       1
## 4                  5 2022-01-01 00      0    Saturday       1
## 5                  4 2022-01-01 01      1    Saturday       1
## 6                  5 2022-01-01 01      1    Saturday       1

After doing some data manipulations in the first dataset, here, the modified time series will be merged by id column with the second dataset which has the location information of each id.

trafficMAD = merge(trafficMAD, locationMAD, by="id")
head(trafficMAD,5)
##     id        long_date elem_type.x intensity occupation charge vmed error
## 1 1001 01/01/2022 00:00         M30       408          1      0   61     N
## 2 1001 01/01/2022 00:15         M30       156          0      0   69     N
## 3 1001 01/01/2022 00:30         M30       192         NA      0   76     N
## 4 1001 01/01/2022 00:45         M30       600          2      0   67     N
## 5 1001 01/01/2022 01:00         M30      1065          2      0   65     N
##   integration_period           day hourly day_of_week week_no elem_type.y
## 1                  5 2022-01-01 00      0    Saturday       1         M30
## 2                  5 2022-01-01 00      0    Saturday       1         M30
## 3                  5 2022-01-01 00      0    Saturday       1         M30
## 4                  5 2022-01-01 00      0    Saturday       1         M30
## 5                  4 2022-01-01 01      1    Saturday       1         M30
##   district   cod_cent       name  utm_x   utm_y longitude latitude
## 1       10 05FT10PM01 05FT10PM01 437146 4473498 -3.740786 40.40973
## 2       10 05FT10PM01 05FT10PM01 437146 4473498 -3.740786 40.40973
## 3       10 05FT10PM01 05FT10PM01 437146 4473498 -3.740786 40.40973
## 4       10 05FT10PM01 05FT10PM01 437146 4473498 -3.740786 40.40973
## 5       10 05FT10PM01 05FT10PM01 437146 4473498 -3.740786 40.40973

Sorting the intensity for each district

After getting the location information of each id, here, the district information will be used to find the highest traffic intensity area of Madrid. As mentioned before, district contains multiple ids and represents the area of Madrid city. Therefore, it is interesting to do time series forecasting in the area of Madrid which has the highest traffic intensity among all.

## Defining the district column as factor
trafficMAD$district = as.factor(trafficMAD$district)

## Aggregating the minute frequency into hour and areas
trafficMAD %>% drop_na(occupation) %>%
   group_by(district,day) %>% 
   summarise(intensity=mean(intensity, na.rm = FALSE)) %>% 
   mutate(day=as.POSIXct(day, format='%Y-%m-%d %H'))  -> trafficMAD_area
## `summarise()` has grouped output by 'district'. You can override using the
## `.groups` argument.
trafficMAD_area %>% arrange(desc(intensity))
## # A tibble: 15,623 x 3
## # Groups:   district [21]
##    district day                 intensity
##    <fct>    <dttm>                  <dbl>
##  1 14       2022-01-21 15:00:00     2300.
##  2 14       2022-01-14 15:00:00     2234.
##  3 14       2022-01-28 15:00:00     2198 
##  4 14       2022-01-21 14:00:00     2175.
##  5 14       2022-01-28 14:00:00     2137.
##  6 14       2022-01-14 14:00:00     2057.
##  7 10       2022-01-03 08:00:00     2041.
##  8 14       2022-01-24 14:00:00     2023.
##  9 14       2022-01-20 14:00:00     2006.
## 10 14       2022-01-20 15:00:00     2006 
## # ... with 15,613 more rows

It can be seen after sorting the intensity in descending order, the most traffic has occurred in the area 14, even though there are some high intensity values in the other areas. Therefore, it is good to analyze this district.

Also, the intensity of each district can be visualized as below to see clearly the hourly intensity.

p <- ggplot(data = trafficMAD_area, 
            aes(x = day, y = intensity, color = district)) + 
     geom_line() + ggtitle("Traffic Intensity in each district", 
                           subtitle = "2022-01-01 to 2022-01-31")
p + facet_wrap(~district)

It is clear that the 14th district has the highest traffic intensity in January, 2022.

Choosing the highest intensity area in Madrid to analyze

Here, the 14th district will be filtered in the trafficMAD data.

district_id =   14 # district 14 

trafficMAD %>% 
  filter(district==district_id) %>%
  ## making the 'day' column as factor to use it in group by later.
  mutate(day=as.POSIXct(day, format='%Y-%m-%d %H')) -> trafficMAD_14
head(trafficMAD_14,5)
##     id        long_date elem_type.x intensity occupation charge vmed error
## 1 3495 01/01/2022 00:00         M30       204          0      5   93     N
## 2 3495 01/01/2022 00:15         M30       173          0      2   91     N
## 3 3495 01/01/2022 00:30         M30       492          0      9   84     N
## 4 3495 01/01/2022 00:45         M30      1060          1     22   81     N
## 5 3495 01/01/2022 01:00         M30      1256          2     25   81     N
##   integration_period                 day hourly day_of_week week_no elem_type.y
## 1                 15 2022-01-01 00:00:00      0    Saturday       1         M30
## 2                 15 2022-01-01 00:00:00      0    Saturday       1         M30
## 3                 15 2022-01-01 00:00:00      0    Saturday       1         M30
## 4                 15 2022-01-01 00:00:00      0    Saturday       1         M30
## 5                 15 2022-01-01 01:00:00      1    Saturday       1         M30
##   district cod_cent    name    utm_x   utm_y longitude latitude
## 1       14  PM30752 PM30752 444467.3 4474308 -3.654574 40.41754
## 2       14  PM30752 PM30752 444467.3 4474308 -3.654574 40.41754
## 3       14  PM30752 PM30752 444467.3 4474308 -3.654574 40.41754
## 4       14  PM30752 PM30752 444467.3 4474308 -3.654574 40.41754
## 5       14  PM30752 PM30752 444467.3 4474308 -3.654574 40.41754

The IDs in the district 14

trafficMAD_14 %>% 
  group_by(id,name) %>% summarise(Unique_IDs = n_distinct(id))
## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.
## # A tibble: 2 x 3
## # Groups:   id [2]
##   id    name    Unique_IDs
##   <fct> <chr>        <int>
## 1 3495  PM30752          1
## 2 3561  PM30753          1

Plotting the ids area in Madrid map

Here, the 14th district will be plot by using the longitude and latitude of 3495 and 3561 ids.

ids =   c(3495,3561) 
locationMAD %>% 
  filter(id==ids) -> coordinates
coordinates
##   elem_type district   id cod_cent    name    utm_x   utm_y longitude latitude
## 1       M30       14 3495  PM30752 PM30752 444467.3 4474308 -3.654574 40.41754
## 2       M30       14 3561  PM30753 PM30753 444465.8 4474300 -3.654592 40.41747
# getting the map
map14 <- get_map(location = c(longitude = mean(coordinates$longitude), latitude = mean(coordinates$latitude)), zoom = 12,
                      maptype = "roadmap", scale = 2)

# plotting the map with some points on it
ggmap(map14) +
  geom_point(data = coordinates, aes(x = longitude, y = latitude, fill = "darkred", alpha = 0.8), size = 8, shape = 21) +
  guides(fill=FALSE, alpha=FALSE, size=FALSE)

As seen after getting the distinct of the IDs in the area 14, there are two ids that are 3495 and 3561. Therefore, I am going to analyze these areas.

Aggregating the minute frequency into hour

After filtering the busiest area of Madrid which is 14 with ids 3495 and 3561, here the frequency will be changed from minute to hour to make the time series smoother. The output will be an “hourly profile” of the entire month (24 lags), because there is 24 hours in a day.

## Aggregating the minute frequency into hour
trafficMAD_14 %>% drop_na(occupation,vmed) %>% 
   group_by(district,day) %>% 
   summarise(occupation=median(occupation, na.rm = FALSE),
             intensity=mean(intensity, na.rm = FALSE),
             vmed=mean(vmed, na.rm = FALSE))  -> trafficMAD_14_hr
## `summarise()` has grouped output by 'district'. You can override using the
## `.groups` argument.
trafficMAD_14_hr
## # A tibble: 744 x 5
## # Groups:   district [1]
##    district day                 occupation intensity  vmed
##    <fct>    <dttm>                   <dbl>     <dbl> <dbl>
##  1 14       2022-01-01 00:00:00        0        288.  67.1
##  2 14       2022-01-01 01:00:00        1.5      863.  87  
##  3 14       2022-01-01 02:00:00        1.5      696.  81.1
##  4 14       2022-01-01 03:00:00        0.5      421.  67  
##  5 14       2022-01-01 04:00:00        0        288.  67  
##  6 14       2022-01-01 05:00:00        0.5      258.  57.4
##  7 14       2022-01-01 06:00:00        0.5      340   61.8
##  8 14       2022-01-01 07:00:00        0.5      381.  74.4
##  9 14       2022-01-01 08:00:00        0        342.  77.6
## 10 14       2022-01-01 09:00:00        0        289.  68.9
## # ... with 734 more rows

After getting the hourly multiple time series, here three interesting variables(intensity, occupation and vmed) will be plot to see their series.

trafficMAD_14_hr %>%
  pivot_longer(c(3:5), names_to = "var", values_to = "value") %>%
  ggplot(aes(x = day, y = value, group = 1)) +
  geom_line() +
  facet_grid(vars(var),scales = "free") +
  labs(title = "Madrid Traffic in district 14 (in hour)",
       y = "values")

As seen on the above time series, the mean of intensity per hour and the median of occupation level are almost the same. It makes sense because occupation is an indicator level for the traffic intensity in the data. In the VAR model, these three variables will be analyzed in detail.

Statistical Tools

Vector autoregression (VAR)

The vector autoregressive (VAR) model is a multivariate time series model that relates current observations of a variable with past observations of itself and past observations of other variables. Therefore, in this part, I am going to analyze the three variable in the data.

Converting to a tsbille time series with ts and as_tsibble functions

To do that, the tbille data must be converted to tsbille format.

#trafficMAD_14_hr$occupation <- NULL

## Converting to a tsbille
traffic_tsbl = as_tsibble(trafficMAD_14_hr, index = day)

class(traffic_tsbl)
## [1] "grouped_ts" "grouped_df" "tbl_ts"     "tbl_df"     "tbl"       
## [6] "data.frame"
traffic_tsbl
## # A tsibble: 744 x 5 [1h] <?>
## # Groups:    district [1]
##    district day                 occupation intensity  vmed
##    <fct>    <dttm>                   <dbl>     <dbl> <dbl>
##  1 14       2022-01-01 00:00:00        0        288.  67.1
##  2 14       2022-01-01 01:00:00        1.5      863.  87  
##  3 14       2022-01-01 02:00:00        1.5      696.  81.1
##  4 14       2022-01-01 03:00:00        0.5      421.  67  
##  5 14       2022-01-01 04:00:00        0        288.  67  
##  6 14       2022-01-01 05:00:00        0.5      258.  57.4
##  7 14       2022-01-01 06:00:00        0.5      340   61.8
##  8 14       2022-01-01 07:00:00        0.5      381.  74.4
##  9 14       2022-01-01 08:00:00        0        342.  77.6
## 10 14       2022-01-01 09:00:00        0        289.  68.9
## # ... with 734 more rows

Here, while training, the data will be Filtered to take the first 29 days and leaving last 2 days for the evaluation as there are 31 days.

model_var <-  traffic_tsbl %>% 
  filter(day(day)<30) %>%
  model(aicc=VAR(vars(occupation,intensity,vmed), season=24),
        bic=VAR(vars(occupation,intensity,vmed), ic="bic",season=24)
  )
glance(model_var)
## # A tibble: 2 x 6
##   .model sigma2        log_lik    AIC   AICc    BIC
##   <chr>  <list>          <dbl>  <dbl>  <dbl>  <dbl>
## 1 aicc   <dbl [3 x 3]>  -7109. 14332. 14342. 14590.
## 2 bic    <dbl [3 x 3]>  -7132. 14361. 14368. 14579.

When comparing the Bayesian Information Criteria(BIC) and the Akaike’s Information Criteria(AIC), penalty for additional parameters is more in BIC than AIC. Unlike the AIC, the BIC penalizes free parameters more strongly. For instance, occupation is quite similar with intensity, As the BIC score of the bic model is lower than aicc, bic model will be analyzed.

report(model_var$bic[[1]])
## Series: occupation, intensity, vmed 
## Model: VAR(4) w/ mean 
## 
## Coefficients for occupation:
##       lag(occupation,1)  lag(intensity,1)  lag(vmed,1)  lag(occupation,2)
##                 -0.0097            0.0049      -0.0159             0.2695
## s.e.             0.0697            0.0003       0.0040             0.0705
##       lag(intensity,2)  lag(vmed,2)  lag(occupation,3)  lag(intensity,3)
##                -0.0029       0.0115             0.0104            0.0011
## s.e.            0.0003       0.0050             0.0716            0.0004
##       lag(vmed,3)  lag(occupation,4)  lag(intensity,4)  lag(vmed,4)  constant
##           -0.0106             0.0696           -0.0011       0.0044    0.4861
## s.e.       0.0049             0.0709            0.0003       0.0039    0.2086
## 
## Coefficients for intensity:
##       lag(occupation,1)  lag(intensity,1)  lag(vmed,1)  lag(occupation,2)
##                -41.0141            1.4622      -1.0922            81.0893
## s.e.            18.9054            0.0756       1.0846            19.1342
##       lag(intensity,2)  lag(vmed,2)  lag(occupation,3)  lag(intensity,3)
##                -0.7903       2.6844            32.5909            0.1747
## s.e.            0.0937       1.3438            19.4187            0.0978
##       lag(vmed,3)  lag(occupation,4)  lag(intensity,4)  lag(vmed,4)  constant
##           -2.4583            25.5861           -0.3123       0.4117  234.5879
## s.e.       1.3383            19.2244            0.0887       1.0545   56.5754
## 
## Coefficients for vmed:
##       lag(occupation,1)  lag(intensity,1)  lag(vmed,1)  lag(occupation,2)
##                 -2.7913            0.0206       0.7123             0.7374
## s.e.             0.7511            0.0030       0.0431             0.7602
##       lag(intensity,2)  lag(vmed,2)  lag(occupation,3)  lag(intensity,3)
##                -0.0040      -0.0340             2.1619           -0.0033
## s.e.            0.0037       0.0534             0.7715            0.0039
##       lag(vmed,3)  lag(occupation,4)  lag(intensity,4)  lag(vmed,4)  constant
##           -0.0131            -0.3487           -0.0054      -0.1294   30.3338
## s.e.       0.0532             0.7637            0.0035       0.0419    2.2476
## 
## Residual covariance matrix:
##            occupation  intensity     vmed
## occupation     0.4549   102.1403   1.3679
## intensity    102.1403 33466.0145 614.2559
## vmed           1.3679   614.2559  52.8196
## 
## log likelihood = -7132.38
## AIC = 14360.76   AICc = 14368.07 BIC = 14578.66

Let’s analyze the residuals. Here, the .innov is used which contains the innovation residuals In this case, as no transformation has been used then the innovation residuals are identical to the regular residuals.

Residuals are useful in checking whether a model has adequately captured the information in the data. For this reason, innovation residuals will be checked with ACF(.innov).

model_var %>%
  select(bic) %>%
  augment() %>%
  ACF(.innov) %>%
  autoplot()

These there ACF show that the mean of the residuals is close to zero and there is no significant correlation in the residuals series. The variation of the residuals stays much the same across the historical data, however there are outliers at lag 24 which indicates that the ACF clearly shows an hourly seasonal trend which peaks at hourly lag 24.

Forecasting

Here, the following two days (24+24=48 hours) will be estimated.

model_var%>%
  select(bic) %>%
  forecast(h=48) %>% 
    autoplot(traffic_tsbl, size=2)
## `mutate_if()` ignored the following grouping variables:
## * Column `district`

As seen on the above results, the intensity and occupation are almost the same. As the aim is to predict the traffic congestion, the response variable should be intensity. In this case, first the traffic intensity will be predicted and then dynamic regression model will be implied to analyze the most two significant variables, vmed and intensity.

Seasonal ARIMA model

In this part, first the Traffic Intensity in the most busy area in Madrid will be predicted by defining a manual seasonal ARIMA model.

# Traffic congestion in the 14th area
traffic_tsbl %>% ggplot(aes(x=day , y=intensity)) + geom_line(col="deepskyblue2",size=1.2) +
  labs(title = 'Monthly Traffic Congestion in the district-14', subtitle="January, 2022", x = '', y = 'intensity') + theme_minimal()

As seen on the above plot, overall, the traffic intensity in the most busy area in Madrid tend to increase from the beginning to the end of January.

Residuals of Intensity variable

traffic_tsbl %>% gg_tsdisplay(y=intensity, plot_type = "partial", lag=24)

As seen on the above result, ACF/PACF plots of the time series presented above provide some insight for the possible models. All these graphs indicate that, a time series model with seasonal period 24 may be suitable, because residuals are representing seasonality, and they are not stationary. Therefore, it would be good to fit a seasonal ARIMA model. To choose the best seasonal ARIMA, it should be looked at the PACF to select the non-seasonal part of the model and the ACF to select the seasonal part of the model. After checking the residuals, ARIMA(1,0,1)(2,1,0) would be good model for this data. However, later this arima model will be tested by implying the auto arima, to see whether it is the best or not.

Seasonal ARIMA Model - Manuel

model_arima_manuel <- traffic_tsbl %>% 
  filter(day(day)<30) %>%
  model(ARIMA(intensity ~ pdq(p=1, d=0, q=1) + PDQ(P=2, D=1, Q=0)))

report(model_arima_manuel)
## Series: intensity 
## Model: ARIMA(1,0,1)(2,1,0)[24] 
## 
## Coefficients:
##          ar1     ma1     sar1     sar2
##       0.8020  0.1798  -0.4142  -0.2633
## s.e.  0.0267  0.0445   0.0396   0.0388
## 
## sigma^2 estimated as 18205:  log likelihood=-4251.26
## AIC=8512.51   AICc=8512.6   BIC=8535.06

The AIC has been found 8512 and BIC 8535. Let’s analyze the residuals of fitted arima model.

model_arima_manuel %>% gg_tsresiduals(lag=24)

As seen on the above results, residuals of the fitted model presents stationarity which means that this model would be a good fit for this dataset.

Forecasting with the manuel ARIMA

Here, the next 48 hours (2days) will be predicted.

fc = model_arima_manuel %>% forecast(h = 48)

fc %>% 
  autoplot(traffic_tsbl, size=1.5) +
  labs(title = "Traffic Intensity",subtitle='Hourly Forecasts for the 2 days ahead', x='', y = '') + theme_minimal()
## `mutate_if()` ignored the following grouping variables:
## * Column `district`

Evaluation Metrics for the Test set

estimations <- forecast(model_arima_manuel,h = 48)
test <-  trafficMAD_14_hr%>% 
           filter(day(day)>=30)

rmse <- function (x, y) 
{
  RMSE <- sqrt(mean((y - x)^2))
  return(RMSE)
}
glue::glue('RMSE: ', rmse(estimations$.mean,test$intensity))
## RMSE: 223.77997580009

The RMSE of the test set is found 223. To see other evaluation metrics, accuracy function will be used from the forecast package.

forecast::accuracy(estimations$.mean,test$intensity)
##                 ME   RMSE      MAE       MPE     MAPE
## Test set -44.62964 223.78 187.7288 -23.18167 35.44874

The evaluation results for the ARIMA(1,0,1)(2,1,0)[24] model has been found 223 for RMSE and 187 for MAE. To see whether this model is the best, the auto arima will be implemented in the automatic forecasting part.

Statistical Tools with time series cross validation

Automatic Forecasting Models

In this part, arima, ets, tslm and prophet models will be tried respectively. Also, the manuel seasonal arima model will be added to see whether it is the best choice or not. Models will be trained and tested by using time series cross validation sliding mode to see and improve the model performance.

# Seasonal plot
traffic_tsbl %>%
  plot_time_series(.date_var=day, .value=intensity, .interactive = TRUE,
                   .title='Traffic intensity in Madrid for the district 14 in January, 2022',
                   .y_lab = "Monthly data in hour") 

Seasonal decomposition with model time:

traffic_tsbl %>%
    plot_stl_diagnostics(day, intensity,
        .frequency = "auto", .trend = "auto",
        .feature_set = c("observed", "season", "trend", "remainder"),
        .interactive = T)

On the below STL diagnotics, the hourly seasonality could be seen clearly. Also as mentioned before, in the traffic intensity hourly time series, there is a upward trend until the end of the month.

Time-series cross-validation

To prepare the data set for time series cross-validation, the tscv package will be used. From this package, the function make_split() will split the data into several slices for training and testing for time series cross-validation. For the split windows, there are two modes available, stretch and slide. The first is an expanding window approach, while the latter is a fixed window approach. For this project, the slide mode will be used. (Because in real time-series, as they are non-stationary, it is usually better to consider a fixed training window than an expanding one.)

The size for the training window will be 600 and in the testing window, 24 hours will be predicted. As it is an hourly time series, therefore the n_lag is 0 and the step size for increments (n_skip) is defined 23.

series_id = "district"
value_id = "intensity"
index_id = "day"

context <- list(
  series_id = series_id,
  value_id = value_id,
  index_id = index_id
)

# Setup for time series cross validation
type = "first" 
value = 600       # size for training window 
n_ahead = 24      # size for testing window (= forecast horizon)
n_skip = 23       # skip 23 observations because it is an hourly data
n_lag = 0         # no lag
mode = "slide"    # fixed window approach
exceed = FALSE    

split_frame <- make_split(
  main_frame = trafficMAD_14_hr,
  context = context,
  type = type,
  value = value,
  n_ahead = n_ahead,
  n_skip = n_skip,
  n_lag = n_lag,
  mode = mode,
  exceed = exceed
)

split_frame
## # A tibble: 6 x 4
##   district split train       test      
##   <fct>    <int> <list>      <list>    
## 1 14           1 <int [600]> <int [24]>
## 2 14           2 <int [600]> <int [24]>
## 3 14           3 <int [600]> <int [24]>
## 4 14           4 <int [600]> <int [24]>
## 5 14           5 <int [600]> <int [24]>
## 6 14           6 <int [600]> <int [24]>

By defining these settings, the dataset has been split into 6 folds.

From these splits, train and test dataframes will be created by using slice_train and slice_test functions as in split_frame the indices are known, basically here with this function these indices will be taken from the main trafficMAD_14_hr data.

# Slice training data from main_frame according to split_frame
train_frame <- slice_train(
  main_frame = trafficMAD_14_hr,
  split_frame = split_frame,
  context = context
  )

head(train_frame,5)
## # A tibble: 5 x 6
## # Groups:   district [1]
##   district day                 occupation intensity  vmed split
##   <fct>    <dttm>                   <dbl>     <dbl> <dbl> <int>
## 1 14       2022-01-01 00:00:00        0        288.  67.1     1
## 2 14       2022-01-01 01:00:00        1.5      863.  87       1
## 3 14       2022-01-01 02:00:00        1.5      696.  81.1     1
## 4 14       2022-01-01 03:00:00        0.5      421.  67       1
## 5 14       2022-01-01 04:00:00        0        288.  67       1

After creating a tibble train_frame, this should be converted to tsibble format.

# Convert tibble to tsibble
train_frame_tsbl <- train_frame %>%
                  as_tsibble(key = c(district, split),index = day)
head(train_frame_tsbl,5)
## # A tsibble: 5 x 6 [1h] <?>
## # Key:       district, split [1]
## # Groups:    district [1]
##   district day                 occupation intensity  vmed split
##   <fct>    <dttm>                   <dbl>     <dbl> <dbl> <int>
## 1 14       2022-01-01 00:00:00        0        288.  67.1     1
## 2 14       2022-01-01 01:00:00        1.5      863.  87       1
## 3 14       2022-01-01 02:00:00        1.5      696.  81.1     1
## 4 14       2022-01-01 03:00:00        0.5      421.  67       1
## 5 14       2022-01-01 04:00:00        0        288.  67       1

Every step for creating the train_frame is also done for defining the test_frame.

# Slice training data from main_frame according to split_frame
test_frame <- slice_test(
  main_frame = trafficMAD_14_hr,
  split_frame = split_frame,
  context = context
  )

test_frame_tsbl <- test_frame %>% as_tsibble(key = c(district, split),index = day)
head(test_frame_tsbl,5)
## # A tsibble: 5 x 6 [1h] <?>
## # Key:       district, split [1]
## # Groups:    district [1]
##   district day                 occupation intensity  vmed split
##   <fct>    <dttm>                   <dbl>     <dbl> <dbl> <int>
## 1 14       2022-01-26 00:00:00          1     328.   64       1
## 2 14       2022-01-26 01:00:00          0     156.   49.1     1
## 3 14       2022-01-26 02:00:00          0      99    41.4     1
## 4 14       2022-01-26 03:00:00          0      95.6  39.4     1
## 5 14       2022-01-26 04:00:00          0     125.   44.1     1

Forecasting Fable auto models with time series cross validation

model_fable_auto <- train_frame_tsbl %>% 

  model(arima_manuel = ARIMA(intensity ~ pdq(p=1, d=0, q=1) + PDQ(P=2, D=1, Q=0)),
        arima = ARIMA(intensity), 
        ets = ETS(intensity ~ season(c("A","M"), period=24)),
        lm = TSLM(intensity ~ I(day(day)^2) + season(period=24)),
        prophet = prophet(intensity ~ season(period = 24))
)

Training Accuracy

fabletools::accuracy(model_fable_auto)
## # A tibble: 30 x 12
##    district split .model    .type        ME  RMSE   MAE    MPE  MAPE  MASE RMSSE
##    <fct>    <int> <chr>     <chr>     <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
##  1 14           1 arima_ma~ Trai~  4.65e+ 0  135.  85.3 -11.5   23.0 0.484 0.492
##  2 14           1 arima     Trai~  4.65e+ 0  135.  85.3 -11.5   23.0 0.484 0.492
##  3 14           1 ets       Trai~  2.67e- 1  127.  88.5  -9.76  22.3 0.502 0.463
##  4 14           1 lm        Trai~ -1.99e-14  234. 181.  -26.2   45.3 1.02  0.850
##  5 14           1 prophet   Trai~ -2.35e- 1  237. 183.  -27.7   48.8 1.04  0.862
##  6 14           2 arima_ma~ Trai~  4.20e+ 0  131.  82.5 -11.3   21.7 0.491 0.486
##  7 14           2 arima     Trai~  4.20e+ 0  131.  82.5 -11.3   21.7 0.491 0.486
##  8 14           2 ets       Trai~  1.44e- 1  122.  84.8  -9.68  21.5 0.504 0.453
##  9 14           2 lm        Trai~ -5.21e-15  216. 166.  -24.1   42.8 0.984 0.803
## 10 14           2 prophet   Trai~  7.95e- 2  224. 171.  -25.5   45.9 1.02  0.835
## # ... with 20 more rows, and 1 more variable: ACF1 <dbl>
fabletools::accuracy(model_fable_auto) %>%
  plot_line(
    x = split,
    y = MAE,
    facet_var = .type,
    facet_nrow = 1,
    color = .model,
    title = "Evaluation of forecast accuracy by split",
    subtitle = "Mean absolute error (MAE)",
    xlab = "Split"
    )

As seen from the training accuracies, the least training error has been found with arima and arima_manuel models. As the error rate was found the same, it means that the chosen arima model in auto arima was the same with manuel defined arima. (This will be analyzed in detail in the following part.) Then the second one is ets model. On the other hand, model lm and prophet could not perform well in each split.

Therefore, let’s analyze the best two models, arima and ets.

model_fable_auto$arima
## <lst_mdl[6]>
## [1] <ARIMA(1,0,1)(2,1,0)[24]> <ARIMA(1,0,1)(2,1,0)[24]>
## [3] <ARIMA(1,0,1)(2,1,0)[24]> <ARIMA(1,0,1)(2,1,0)[24]>
## [5] <ARIMA(2,0,0)(1,1,0)[24]> <ARIMA(1,0,1)(2,1,0)[24]>

As seen on the above results, only in the 5th split, the arima model was chosen differently which is ARIMA(2,0,0)(1,1,0)[24]. For the remaining splits, ARIMA(1,0,1)(2,1,0)[24] model was chosen which is the same with the manuel defined seasonal arima model.

Let’s analyze the training split residuals with the least MAE score which is in the 4th split.

model_fable_auto %>% select(arima) %>% filter(split==4) %>% gg_tsresiduals(lag=24)

The residuals look stationary.

model_fable_auto$ets
## <lst_mdl[6]>
## [1] <ETS(A,N,A)> <ETS(A,N,A)> <ETS(A,N,A)> <ETS(A,N,A)> <ETS(A,N,A)>
## [6] <ETS(A,N,A)>

As seen on the above results, the ETS(A,N,A) model was chosen for all the splits. Therefore, ETS(A,N,A) model will be fit to the data. ETS model combines a “level”, “trend” (slope) and “seasonal” component to describe a time series. Therefore, in this case, the model of ETS has the Additive Error, None Trend and Additive Seasonal parameters.

Let’s analyze the training split residuals with the least MAE score which is in the 4th split.

model_fable_auto %>% select(ets) %>% filter(split==4) %>% gg_tsresiduals(lag=24)

It can be seen on the acf plot that the residuals of arima fit look more stationary than the ets one.

Forecasting

# Forecasting
fable_frame <- model_fable_auto %>%
  forecast(h = n_ahead) ## estimating the next 24 hours

fable_frame
## # A fable: 720 x 6 [1h] <?>
## # Key:     district, split, .model [30]
##    district split .model       day                      intensity .mean
##    <fct>    <int> <chr>        <dttm>                      <dist> <dbl>
##  1 14           1 arima_manuel 2022-01-26 00:00:00  N(377, 19176)  377.
##  2 14           1 arima_manuel 2022-01-26 01:00:00  N(222, 36732)  222.
##  3 14           1 arima_manuel 2022-01-26 02:00:00  N(148, 47981)  148.
##  4 14           1 arima_manuel 2022-01-26 03:00:00  N(140, 55188)  140.
##  5 14           1 arima_manuel 2022-01-26 04:00:00  N(177, 59807)  177.
##  6 14           1 arima_manuel 2022-01-26 05:00:00  N(253, 62766)  253.
##  7 14           1 arima_manuel 2022-01-26 06:00:00  N(538, 64662)  538.
##  8 14           1 arima_manuel 2022-01-26 07:00:00  N(719, 65877)  719.
##  9 14           1 arima_manuel 2022-01-26 08:00:00 N(1030, 66655) 1030.
## 10 14           1 arima_manuel 2022-01-26 09:00:00  N(976, 67154)  976.
## # ... with 710 more rows
# Converting fable_frame to future_frame (tibble)
future_frame <- make_future(
  fable = fable_frame,
  context = context
  )

future_frame
## # A tibble: 720 x 6
##    day                 district model        split horizon point
##    <dttm>              <fct>    <chr>        <int>   <int> <dbl>
##  1 2022-01-26 00:00:00 14       arima_manuel     1       1  377.
##  2 2022-01-26 01:00:00 14       arima_manuel     1       2  222.
##  3 2022-01-26 02:00:00 14       arima_manuel     1       3  148.
##  4 2022-01-26 03:00:00 14       arima_manuel     1       4  140.
##  5 2022-01-26 04:00:00 14       arima_manuel     1       5  177.
##  6 2022-01-26 05:00:00 14       arima_manuel     1       6  253.
##  7 2022-01-26 06:00:00 14       arima_manuel     1       7  538.
##  8 2022-01-26 07:00:00 14       arima_manuel     1       8  719.
##  9 2022-01-26 08:00:00 14       arima_manuel     1       9 1030.
## 10 2022-01-26 09:00:00 14       arima_manuel     1      10  976.
## # ... with 710 more rows

Estimating accuracy metrics by forecast horizon

In this part, forecast errors will be plot for each predicted points to see in which part models’ prediction is with higher error score.

accuracy_horizon <- make_accuracy(
  future_frame = future_frame,
  main_frame = trafficMAD_14_hr,
  context = context,
  dimension = "horizon"
)

# Visualize results
accuracy_horizon %>%
  filter(metric == "MAE") %>%
  plot_line(
    x = n,
    y = value,
    facet_var = district,
    facet_nrow = 1,
    color = model,
    title = "Evaluation of forecast accuracy by forecast horizon",
    subtitle = "Mean absolute error (MAE)",
    xlab = "Forecast horizon (n-step-ahead)"
    )

As seen on the above results, the forecast errors for each hour was plot. In general, each model’s forecast accuracy is getting worse while predicting the 5-10 hours period and 15-20 hours period. On the other hand, the beginning of the day (night time) have predicted with the least error for each model. Overall, the arima and ets models are better to predict.

Forecast accuracy by split

# Estimate accuracy metrics by forecast horizon
accuracy_split <- make_accuracy(
  future_frame = future_frame,
  main_frame = trafficMAD_14_hr,
  context = context,
  dimension = "split"
)

# Visualize results
accuracy_split %>%
  filter(metric == "MAE") %>%
  plot_line(
    x = n,
    y = value,
    facet_var = district,
    facet_nrow = 1,
    color = model,
    title = "Evaluation of forecast accuracy by split",
    subtitle = "Mean absolute error (MAE)",
    xlab = "Split"
    )

On the above visual, the MAE error rate for each split was plot. The least MAE error has been found in both arima and arima model. They are the same because automatic arima chose the same seasonal ARIMA model. That’s why the red line for auto arima only appears between 4 and 6 split, because the model is changing for the split 5.

Therefore, I am going to visualize the traffic intensity prediction with split 2 as it has the least error.

# It is convenient to define a training set
model_arima <- train_frame_tsbl %>% 
  filter(split==2) %>%
  model(arima = ARIMA(intensity))
model_arima$arima
## <lst_mdl[1]>
## [1] <ARIMA(1,0,1)(2,1,0)[24]>

Forecasting

Here, the following two days hourly traffic intensity will be predicted.

fc2 = model_arima %>% forecast(h = n_ahead)
 fc2 %>% 
  autoplot(rbind(train_frame_tsbl %>% filter(split==2),test_frame_tsbl %>% filter(split==2)), size=1.5, level = NULL) +
  labs(title = "Hourly Traffic Intensity",subtitle='2022-01-01 to 2022-01-31', x='', y = '') + theme_minimal()
## `mutate_if()` ignored the following grouping variables:
## * Column `district`

Prediction Intervals

pred_interval <- fc2 %>% hilo(level = c(80, 95))
pred_interval %>% filter(.model=="arima") -> pred_interval_arima
pred_interval_arima$`95%`
## <hilo[24]>
##  [1] [  90.17934,  614.3613]95 [-182.36791,  543.5951]95
##  [3] [-289.85891,  539.1362]95 [-329.38761,  558.8603]95
##  [5] [-311.06216,  612.8469]95 [-206.38509,  739.4750]95
##  [7] [  99.85046, 1059.3911]95 [ 379.47416, 1347.6031]95
##  [9] [ 806.42970, 1779.9739]95 [ 629.79104, 1606.7589]95
## [11] [ 646.29471, 1625.4308]95 [ 736.21570, 1716.7263]95
## [13] [ 830.41085, 1811.7935]95 [1051.39926, 2033.3353]95
## [15] [1492.96042, 2475.2478]95 [1444.92294, 2427.4334]95
## [17] [ 953.75828, 1936.4103]95 [1221.77825, 2204.5203]95
## [19] [1385.84890, 2368.6480]95 [1299.50244, 2282.3378]95
## [21] [1067.94057, 2050.7990]95 [ 679.40410, 1662.2772]95
## [23] [ 370.22130, 1353.1037]95 [  18.82449, 1001.7128]95

This above intervals are for the 95% prediction interval. The negative lower bound does not seem logical as the intensity cannot have a negative value. For instance for the second prediction interval it can be said according to the results that with 95% confident that traffic intensity at 1 am will be between -182.36791 and 543.5951. Another example is for the first prediction interval, it can be said that with 95% confident, traffic intensity at 0 (12pm) will be between 90.17934 and 614.3613.

Average Accuracy in splits for each model

fable_frame  %>% fabletools::accuracy(test_frame_tsbl) %>%
  group_by(.model) %>%
  summarise(avg_mae=mean(MAE, na.rm = FALSE),avg_rmse=mean(RMSE,na.rm = FALSE))
## # A tibble: 5 x 3
##   .model       avg_mae avg_rmse
##   <chr>          <dbl>    <dbl>
## 1 arima           176.     221.
## 2 arima_manuel    180.     225.
## 3 ets             228.     280.
## 4 lm              188.     228.
## 5 prophet         193.     243.

After summarizing the test accuracies of each model for MAE and RMSE scores, it is clear that auto arima model has the least error rate. Then the manuel defined seasonal arima is the second best.

Machine Learning for time series

In this part, four machine learning algorithms (glmnet,random forest, XGBoost and neural networks) will be studied to predict the traffic intensity in the district-14.

Preprocessing

Before implying any machine learning algorithm, first the recipe should be defined. Because, Machine learning models are more complex than univariate models (such as ARIMA). Due to this complexity, machine learning implementation generally requires a set of processes. The first step is to create a recipe for preprocessing. This should be defined to learn the pattern of the time series in training. Then, as a second step, model should be defined and specified with their workflows. These models will be defined with modeltime_table() function in order to organize the models with IDs. Later, these specified workflows should be fitted to the data and then forecasting should be done.

In the following chunk, all these steps have been implied step by step for each split. At the end of the loop, each model’s test accuracies has been calculated in each split.

# empty data frame 
errors <- data.frame(matrix(ncol = 10, nrow = 0))

for (sp in 1:nrow(split_frame)) {
  
  ########## Preprocessing - creating the recipe 
  
  recipe_spec <- recipe(intensity ~ day, train_frame_tsbl %>% filter(split==sp)) %>%
    step_timeseries_signature(day) %>%
    # step_rm is to remove features we are not going to use
    step_rm(contains("am.pm"), contains("minute"),contains("year"),contains("month"),
            contains("week"), contains("second"), contains("xts"), contains("iso")
           ) %>%
    step_dummy(all_nominal()) %>%
    step_fourier(day, K=1, period = 24) %>%
    step_normalize(day_hour12)

  ########## models and workflows 
  
  # glmnet
  model_spec_glmnet <- linear_reg(penalty = 0.01, mixture = 0.5) %>%
    set_engine("glmnet")
  
  workflow_fit_glmnet <- workflow() %>%
    add_model(model_spec_glmnet) %>%
    add_recipe(recipe_spec %>% step_rm(day)) %>%
    fit(train_frame_tsbl %>% filter(split==sp))
  
  # randomForest
  model_spec_rf <- rand_forest(trees = 200) %>%
    set_engine("randomForest")
  
  workflow_fit_rf <- workflow() %>%
    add_model(model_spec_rf) %>%
    add_recipe(recipe_spec %>% step_rm(day)) %>%
    fit(train_frame_tsbl %>% filter(split==sp))
  
  # XGBoost
  model_spec_xgboost <- boost_tree() %>%
      set_engine("xgboost")
  
  wflw_fit_xgboost <- workflow() %>%
      add_model(model_spec_xgboost) %>%
      add_recipe(recipe_spec %>% step_rm(day)) %>%
      fit(train_frame_tsbl  %>% filter(split==sp))
  
  # NNETAR
  model_spec_nnetar <- nnetar_reg(seasonal_period = 24) %>%
      set_engine("nnetar")
  
  wflw_fit_nnetar <- workflow() %>%
      add_model(model_spec_nnetar) %>%
      add_recipe(recipe_spec) %>%
      fit(train_frame_tsbl  %>% filter(split==sp))
  
  ############## defining the model table
  model_table <- modeltime_table(
                        workflow_fit_glmnet,
                        workflow_fit_rf,
                        wflw_fit_xgboost,
                        wflw_fit_nnetar) 
  
  ############## calibration table
  
  calibration_table <- model_table %>%
                          modeltime_calibrate(test_frame  %>% filter(split==sp))
  
  errors <- rbind(errors, cbind(calibration_table %>%modeltime_accuracy(),split = sp))

}

errors %>%
  arrange(.model_id)
##    .model_id      .model_desc .type       mae      mape      mase     smape
## 1          1           GLMNET  Test 231.40885 72.715378 1.1252439 37.536556
## 2          1           GLMNET  Test 192.35280 42.624121 1.0724476 29.590428
## 3          1           GLMNET  Test 173.13476 45.424144 0.8989953 27.548745
## 4          1           GLMNET  Test 172.93223 23.686182 1.3267298 23.061434
## 5          1           GLMNET  Test 202.97957 32.064433 1.4380748 33.072783
## 6          1           GLMNET  Test 215.96430 76.395879 1.0614197 37.184388
## 7          2     RANDOMFOREST  Test  71.63340 17.329244 0.3483231 14.344288
## 8          2     RANDOMFOREST  Test  51.56563 21.384201 0.2875000 15.000600
## 9          2     RANDOMFOREST  Test  65.06037 12.449286 0.3378234 10.597963
## 10         2     RANDOMFOREST  Test  97.22560 16.494320 0.7459113 14.558848
## 11         2     RANDOMFOREST  Test  56.15521  9.958338 0.3978498  9.533621
## 12         2     RANDOMFOREST  Test  67.72287 19.086719 0.3328438 14.756105
## 13         3          XGBOOST  Test  58.48969  6.207133 0.2844108  7.085307
## 14         3          XGBOOST  Test  46.36161  6.005814 0.2584854  6.039560
## 15         3          XGBOOST  Test  69.40782  7.381530 0.3603973  7.705369
## 16         3          XGBOOST  Test  60.80116  6.993087 0.4664643  6.738997
## 17         3          XGBOOST  Test  49.41044  6.492984 0.3500643  6.748933
## 18         3          XGBOOST  Test  57.07401  9.258590 0.2805069  8.613378
## 19         4 NNAR(1,1,10)[24]  Test  74.22865  8.742027 0.3609427  8.229002
## 20         4 NNAR(1,1,10)[24]  Test  39.69182  5.573578 0.2212986  5.320445
## 21         4 NNAR(1,1,10)[24]  Test  43.47799  5.960915 0.2257577  5.726547
## 22         4 NNAR(1,1,10)[24]  Test  61.30083  7.914343 0.4702978  7.611275
## 23         4 NNAR(1,1,10)[24]  Test  50.06353  8.272888 0.3546914  8.153844
## 24         4 NNAR(1,1,10)[24]  Test  44.03137  9.701279 0.2164051  8.722336
##         rmse       rsq split
## 1  259.86031 0.8894428     1
## 2  219.70028 0.9069804     2
## 3  208.96817 0.9234675     3
## 4  240.86526 0.8263456     4
## 5  247.63813 0.7891885     5
## 6  241.69813 0.8932117     6
## 7   87.77253 0.9873683     1
## 8   68.31983 0.9943054     2
## 9   76.58205 0.9866995     3
## 10 116.52105 0.9452047     4
## 11  68.11896 0.9684356     5
## 12  82.53661 0.9870516     6
## 13 116.83515 0.9732218     1
## 14  61.54356 0.9950747     2
## 15  91.49927 0.9810309     3
## 16  79.56342 0.9581734     4
## 17  65.72573 0.9787280     5
## 18  85.06645 0.9855021     6
## 19  98.42482 0.9814837     1
## 20  53.27243 0.9968212     2
## 21  55.25984 0.9928755     3
## 22  82.97193 0.9630818     4
## 23  62.38520 0.9821516     5
## 24  64.18482 0.9903956     6

Forecast accuracy by each split

# Visualize results
errors %>%
  arrange(.model_id) %>%
  plot_line(
    x = split,
    y = mae,
    facet_var = .type,
    facet_nrow = 1,
    color = .model_desc,
    title = "Evaluation of forecast accuracy by split",
    subtitle = "Mean absolute error (MAE)",
    xlab = "Split"
    )

According to the split accuracy plot, glmnet model has performed the worst among all models, on the other hand, the NNAR(1,1,10)[24] model seems more regular for each split. In split 2, random forest model had the least MAE score, while the model has the highest error in split 4. XGBoost is similar with the neural network as there is not so much fluctuation in the error lines for each split. However, the XGBoost model performed worst in the split 3.

Visualizing the forecasts

calibration_table %>%
  modeltime_forecast(actual_data = filter(trafficMAD_14_hr,day(day)>15)) %>% ## to see clearly the forecast, previous 15 days has filtered.
  plot_modeltime_forecast(.interactive = FALSE) + labs(title = "Forecasts",y = "")

errors  %>%
  group_by(.model_desc) %>%
  summarise(avg_mae=mean(mae, na.rm = FALSE),avg_rmse=mean(rmse,na.rm = FALSE)) %>%
  arrange(avg_mae,avg_rmse)
## # A tibble: 4 x 3
##   .model_desc      avg_mae avg_rmse
##   <chr>              <dbl>    <dbl>
## 1 NNAR(1,1,10)[24]    52.1     69.4
## 2 XGBOOST             56.9     83.4
## 3 RANDOMFOREST        68.2     83.3
## 4 GLMNET             198.     236.

After considering the mean evaluation accuracy of each model, it is clear that NNAR(1,1,10)[24] model is the winner. After this model, XGBoost is the second.

Conclusions

In this final project, the traffic intensity open data of Madrid city has been used. This multivariate time series raw data had the traffic information of 368 zones of Madrid with every 15 minutes frequency. First, to make the time series smoother and easy to analyze, the data has been aggregated into hour frequency and the observations in the data has highly decreased from 1,0485,575. Apart from this dataset, another dataset has been merged to get the location information of each ID. After getting the district information, the most busy area of Madrid has been chosen to analyze. Among 21 areas, district-14 has been found with the highest intensity. Then the hourly timeseries has been filtered by this district.

For the forecasting part in statistical tools, first the Vector autoregression model has been implied to the data. In this part, with most interesting 3 variables (occupation, intensity and vmed-the speed of the cars), VAR model has been fitted and predicted. As a result, the BIC model found the best and occupation and intensity variables forecasting found almost the same as they were having the similar series. Therefore, in the other statistical models, the intensity has been tried to forecast. After VAR model, manuel seasonal ARIMA has been fitted to the data to forecast the intensity. Then autoforecasting models has been studied such as arima, ets, tslm and prophet models. Also, the manual seasonal arima model added to see whether it is the best choice or not. Except for 5th split, auto arima chose the same model as the manuel defined arima model. After doing time series cross validation, the auto arima performed with the least evaluation accuracy (MAE: 175.7470, RMSE: 221.0155), and the manuel arima has performed similar (MAE: 179.8309, RMSE: 224.7249)

For the machine learning part, glmnet,random forest, XGBoost and neural networks ML algorithms have been studied with time series cross validation. For the cross validation, the same split time series used as statistical tools.The obtained results were way better than statistical models. Only, glmnet could not perform really good. NNAR(1,1,10)[24] (MAE:49.48689, RMSE:65.41056) and XGBOOST (MAE:56.92412, RMSE:83.37226) models performed very well. As a result, neural network was the one with the best model performance, therefore, this algorithm could be used to do more accurate traffic intensity predictions among all tried methods.

References